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ABSTRACT: 

A pipelined scheduler which allows easy implementation and control and further fair scheduling 
among input lines of a crossbar high speed switch fabric is discussed. By means of a round-robin 
communication scheme, a systematically ordered sequence of visits to time slots can be obtained 
regardless of whether the number of scheduler modules is even or odd by framing the time axis 
and using a priority matrix to reserve future time slots. Further, a Carry Ove r_Round-robin 
Pipelined Scheduler (CORPS) achieves scalability to a large number of ports. Moreover, CORPS 
achieves one scheduling decision per line per slot, by scheduling packets in future slots. It is well 
suited to the support of Quality of Service traffic, since the choice of the queues to be scheduled 
is arbitrary. CORPS limits itself to resolve, in a fairway, the contention for output ports. 
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(54) A pipelined packet scheduler for high speed optical switches 

(57) A pipelined scheduler which allows easy imple- 
mentation and control and further fair scheduling 
among input lines of a crossbar high speed switch fabric 
is discussed. By means of a round-robin communication 
scheme, a systematically ordered sequence of visits to 
time slots can be obtained regardless of whether the 
number of scheduler modules is even or odd by framing 
the time axis and using a priority matrix to reserve future 
time slots. Further, a Carry Over Round-robin Pipelined 
Scheduler (CORPS) achieves scalability to a large 
number of ports. Moreover, CORPS achieves one 
scheduling decision per line per slot, by scheduling 
packets in future slots. It is well suited to the support of 
Quality of Service traffic, since the choice of the queues 
to be scheduled is arbitrary. CORPS limits "itself to 
resolve, in a fair way, the contention for output ports. 
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Description 

CROSS REFERENCE TO RELATED APPLICATION: 

[0001] This application is a continuation-in-part of pending Application No. 09/335,908, filed June 18, 1999. 
BACKGROUND OF THE INVENTION 

1 . Field of the invention 

[0002] The present invention relates to network systems and switches that control the flow of data around the net- 
work, and more particularly to schedulers that manage the flow of data through high capacity switches. 

2. Description of the related art 



[0003] Input queue switch architecture has always been an attractive alternative for high speed switching systems, 
mainly because the memory access speed of input buffers scale with the speed of a single input line, not with the total 
switching capacity. However, an input buffered switch has long been known to suffer from head-of-line blocking, which 
puts a theoretical limit of 58.6% in its total throughput See, M. J. Karol, M. G. Hluchyj, S. P. Morgan, "Input Versus Out- 
20 put Queuing on a Space-Division Packet Switch", IEEE Transactions on Communications, Vol. COM-35, No.12, 
pp.1 347-1 356, Dec. 1987. 

[0004] More recently, an input queuing technique, called Virtual Output Queuing (VOQ), has been proposed to 
overcome the head-of-line blocking problem of input switches. See, Y. Tamir and G. Frazier, "High Performance Multi- 
queue Buffers for VLSI Communication Switches", Proceedings of 15th Ann. Symp. on Comp. Arch., pp. 343-354, June 

25 1 988 and T. Anderson, S. Owicki, J. Saxe, C. Thacker, "High Speed Switch Scheduling for Local Area Networks," ACM 
Transactions on Computer Systems, pp. 319-352, Nov. 1993. The idea is to keep separate queues for each output port 
of a switch, so that the possibility of having a packet destined to an available output port blocked from being served by 
a head-of line packet which can not proceed due to contention for a different port is eliminated. Thus, a N x N switch 
has N queues per input port, or N 2 queues. As discussed by others, A. Mekkittikul, N. McKeown, 'A Practical Schedul- 

30 ing Algorithm to Achieve 1 00% Though-put in Input-Queued Switches', Proceedings of lnfocom98, April 1998, further 
exploration of the VOQ technique has shown that indeed 100% throughput is achievable, through the design of smart 
schedulers. 

[0005] Schedulers for VOQ input buffered switches, then, become one of the key design points of a high speed 
input buffered switch. With VOQ, a scheduler has multiple choices for switching packets from backlogged input ports to 

35 output ports, much more than in a regular First-ln-First-Out (FIFO) input queuing architecture. Every input/output pair 
of ports can be selected, among the backlogged input ports. Most work devoted to such schedulers can be classified 
as follows. Centralized schedulers are those for which the scheduler is a single entity, which has information about all 
bp VOQs, and makes a scheduling decision about all possible input/output pairs of ports per packet slot See, for exam- 
ple, A. Mekkittikul, N. McKeown, "A Practical Scheduling Algorithm to Achieve 100% Thoughput in Input-Queued 

40 Switches", Proceedings of infocom98, April 1998. Distributed schedulers, on the other hand, are those for which the 
scheduler is partitioned in functional blocks, usually one or two blocks per input or output port or even one block per 
input/output cross point. See, for example, N. McKeown, M. lzzard, A. Mekkittikul, W. Ellersick, M. Horowitz, The Tiny 
Tera: A Packet Switch Core", IEEE Micro, Jan/Feb 1997, pp.26-32 and Y. Tamir and H-C Chi, "Symmetric Crossbar 
Arbiters for VLSI Communication Switches', IEEE Transactions on Parallel and Distributed Systems, Vol.4, No. 1 , pp. 1 3- 

45 27,1993. 

[0006] Centralized schedulers require the access to A/ 2 pieces of information before scheduling decisions can be 
made. Such schedulers are generally not scalable, in the sense that the hardware to implement such schedulers is 
highly dependent on the number of switch lines N. Rg. 1 illustrates one such scheduler. Distributed schedulers have 
the potential to make the hardware more independent of the number of switch ports. However, the ones proposed so 

so far still require a communication mechanism which provides information about all fit 2 queues before a scheduling deci- 
sion can be made for a particular packet slot This communication can take place either in a parallel way (as in the SLIP 
scheduler, see, N. McKeown, M. lzzard, A. Mekkittikul, W. Ellersick, M. Horowitz, "The Tiny Tera: A Packet Switch 
Core", IEEE Micro, Jan/Feb 1 997, pp. 26-32), or in a round-robin way (See Y. Tamir and H-C Chi, "Symmetric Crossbar 
Arbiters for VLSI Communication Switches^, IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 1 , pp. 

55 13-27, 1993). Both architectures are shown in Fig. 2. The parallel communication architecture requires an explicit 
dependence of each block with the size of the switch, since each block must receive N 2 messages. The round-robin 
architecture overcomes this problem, but creates another one: in order to achieve a scheduling decision about all output 
ports, the message passing must complete a full round within a single packet slot This requires message processing 
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of at least N times faster than the scheduling decisions. 

[0007] More recently, a Round-Robin Greedy Scheduler (RRGS) was proposed, a scheduler based on message 
passing, in which each input port makes a scheduling decision, and passes this information, in a round-robin fashion, 
to a next neighbor. See, A. Smiljanic, R. Fan, G. Ramamurthy, "RRGS-Round-Robin Greedy Scheduling for Elec- 

5 tronic/Qptical Terabit Switches', NEC C&C Research Laboratories, Technical Report TR 98-C063-4-5083-2, 1998. 
See, also, co-pending U.S. Application 09/206,975. In order to reduce message passing speed requirements, RRGS 
introduces a pipeline feature. Input ports make scheduling decisions about future slots, far enough into the future, so as 
to allow enough time for the message passing mechanism to disseminate this information among the other input ports. 
RRGS can provide high speed scheduling. 

w [0008] Before engaging in the description of the present invention, the general pipelined scheduler architecture will 
be discussed. For a switch architecture, it is assumed that the scheduling is applied to a pure non-blocking NxN cross- 
bar switch. It is also assumed that Virtual Output Queues (VOQs) are used to take care of the HOL blocking problem. 
Fig. 3 shows one such switch. 

[0009] In addition, fixed size packets and uniform link speeds are assumed. Time is slotted, where a slot is defined 
15 to be the time taken for the transmission of a single packet by an output link. A non-blocking crossbar can thus switch 
up to N packets per time slot, if no output port contention exists. The basic task of the scheduler is to determine which 
VOQ queues, among the which are non-empty, will have access to the output ports, on a per slot basis. For effi- 
ciency, the scheduler must resolve all contentions among the backlogged queues within a single time slot 
[0010] As line speeds continue to grow, it is paramount that scheduling algorithms be scalable to large capacity 
20 switches. Therefore, a distributed architecture seems attractive, since it alleviates the tight processing time required for 
packet scheduling in a high speed switch. For instance, for a 1 0 Gbit/s line speed, 16x16 port switch, scheduling deci- 
sions must be done at each packet transmission time, 42ns for a 424 bit ATM cell. If a sequential scheduler is used, 
each decision must be made in less than 0.16ns for a 16 x 16 switch, since N? decisions must be made. If an optical 
core is used, it makes sense to distribute the electronic hardware on a per port basis, leaving the total switching band- 
25 width requirement for the optical core. Moreover, a distributed scheduler should naturally scale to any number of lines. 
Fig. 4 illustrates such a scheduler. 

[0011] Each crossbar input port has an Input Port Scheduler Module (SM). Each SM has a distinct ID, SM-ID. In 
order to maintain scalability with the number of lines, a SM is allowed to communicate with a single immediate neighbor 
only. This ensures that the SM hardware block can be used with any Nx N crossbar fabric. The SM communication 

30 chain is shown in Fig. 4. It is used to communicate scheduling information, such as time slot, slot ownership, and output 
port reserved. The only interaction between the crossbar module and the SMs is via a global clock, which tells every 
SM what slot is the Current system Time Slot - CTS - as well as the current decision table, with pairs of input/output 
ports to be switched at CTS (not shown in the figure). This can be implemented by a global memory, to be written by 
the schedulers, and to be read by the crossbar fabric. 

35 [0012] For every time slot, each SM is supposed to have complete freedom of choice to which output port it 
requests access to. SMs with similar choices generate what is hereafter called a collision, which needs to be resolved 
before a global scheduling pattern can be determined for a given slot. If a SM is to have current information about all 
other requests, the communication chain must operate at a speed N times faster than the speed of scheduling deci- 
sions. Namely, SMs would need to be able to receive N messages, before making a single scheduling decision. In order 

40 to keep the speed of the SM hardware scalable with the line speed, a N look ahead scheduling scheme may be 
employed. Namely, each SM will make a scheduling decision about a time slot that is at least N slots ahead of the cur- 
rent slot This feature ensures that a SM knows about others' scheduling decisions already made for the same slot, 
before making its own scheduling decision. Moreover, this feature comes without the need for speeding up the commu- 
nication chain to N times the input line speed. As described above, RRGS has the features of distributed scheduling, 

45 pipeline scheduling, and N look ahead scheduling. 

[0013] Fig. 5 is a time chart showing an example of RRGS scheduling employed in the 4x4 crossbar switch, more 

specifically showing a relationship between four SMs (SM1-SM4) and future time slots T6, T7 at which each of 

SM 1 -SM4 reserves an output port for its own input 

[0014] For example, at a time slotT5 of Fig. 5, SM1 performs the scheduling of future time slotTIO, that is, chooses 
so an output port for transmission at future time slot T10, and SM3 performs the scheduling of future time slot T9. At the 
time slot T6 following T5, SM1 performs the scheduling of future time slot T8, and so on. 

[0015] In this way, each SM performs the scheduling and then transfers the resultant schedule to the next SM, 
ensuring that each SM timely receives from the previous SM scheduling information about output ports which have 
been already scheduled. Therefore, if a SM avoids choosing output ports which have been already picked by previous 
55 "visitors", then collisions can be completely avoided. 

[0016] According to RRGS, however, the sequence of time slots for a SM to pick output ports becomes compli- 
cated. 

[0017] In Fig. 6, more specifically, the respective sequences of time slots for SM1-SM4 are shown, which are 
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obtained by converting the time chart of Rg .5 into a form suitable for a sequence of visits to time slots for each SM. For 
SM1, for instance, the sequence of time slots is T10, T8, T11, T9,... , which are not systematically arranged in time 
sequence or reverse time sequence. This causes the implementation and control of RRGS to become complicated. 
[0018] Further, RRGS performs different scheduling operations depending on whether the number of SMs is even 

5 or odd (see, A. Smiljanic, R. Fan, G. Ramamurthy, 'RRGS-Round-Robin Greedy Scheduling for Electronic/Qptical Ter- 
abit Switches", NEC C&C Research Laboratories, Technical Report TR 98-C063-4-5083-2, 1 998). Therefore, each 
time a SM is added, the control must be changed, resulting in complicated implementation and control. 
[0019] Furthermore, SMs are restricted to picking output ports which have not yet been chosen. Therefore, VOQ 
service rates would become difficult to predict and further a serious fairness problem arises. Assuming in Fig. 4 that 

io SMs #1 and #2 have their queues for a given output port constantly backlogged, while the other SMs have their corre- 
sponding queues empty. In this case, three out of four slots will be picked by SM #1 in Rg. 5, since it visits three out of 
the four slots prior to SM #2, in the sequence of visits as defined in Fig. 5 (see, A. Smiljanic, R. Fan, G. Ramamurthy, 
"RRGS-Round-Robin Greedy Scheduling for Electronic/Optical Terabit Switches', NEC C&C Research Laboratories, 
Technical Report TR 98-C063-4-5083-2, 1 998). 

is [0020] As described above, although the previously described RRGS scheduler can advantageously achieve high- 
speed scheduling, it has disadvantages that the implementation and control of RRGS becomes complicated and further 
predicable and adjustable service rates cannot be realized. As discussed above, there is also a problem of fairness, 
which prevents some of the VOQs from being scheduled because of the states of the other VOQs. 



20 SUMMARY OF THE INVENTION 



[0021 ] An object of the present invention is to provide a fundamental architecture for a scheduler which allows sim- 
plified implementation and control of RRGS. 

[0022] Another object of the present invention is to provide a scheduler for a high capacity switch that allows com- 
25 plete freedom of choice to which VOQ queue to attempt scheduling. 

[0023] Still another object of the invention is to provide a scheduler that makes VOQ service rates both more pre- 
dictable and adjustable. 

[0024] Further still another object of the invention is to provide a scheduler that is also a fair scheduler, in the sense 
that any VOQ has the same chance of being scheduled, regardless of the state other VOQs. 

30 [0025] One extra constraint in the scheduler design is that the decision of which among the N VOQs belonging to 
a given input line to be scheduled next be out of the scheduler control. In other words, an external entity has total free- 
dom to decide which output port should be attempted to be scheduled next, on a per input port basis. This requirement 
is paramount to the future support of Quality of Service (QoS). It is clear that this may reduce the maximum throughput 
in favor of a more predictable service rate of VOQs. This is an important point, however, since the maximization of the 

35 overall switch throughput may lead to starvation of some of the queues, and consequently the flows associated with 
them. 

[0026] According to the first aspect of this invention, a switch is provided for controlling the flow of data in a network, 
having input ports, output ports, and a scheduler having a plurality of input port schedule modules. Each schedule mod- 
ule schedules a particular input port of said input ports for sending data to a designated output port of said output ports. 

40 The schedule modules pass a scheduling message from module to module and each schedule module computes a 
future time slot for which that schedule module will attempt to access the designated output port. The module deter- 
mines if said future time slot is valid based on whether the future time slot is currently reserved by the current schedule 
module, whether the future time slot is blocked or whether the future time slot is taken by another schedule module. The 
schedule module takes the future time slot if valid and enters information into the scheduling message indicating that 

45 the future time slot is taken. 

[0027] In another embodiment, the scheduler of the switch advances the future time slot by a predetermined 
number of time slots when the future time slot has been reserved or taken. 

[0028] In another embodiment, the switch queues data input through said input ports using virtual output queuing 
that maintains separate queues for each of said output ports. Alternatively, the virtual output queuing for a particular 
so port may be independent of the virtual output queuing for the other ports. Additionally, the switch also has service rates 
of the virtual output queuing that are both predictable and adjustable. 

[0029] In another embodiment, the switch scheduler selects the designated output port based on a weighted round 
robin. 

[0030] According to another aspect of this invention, a method is provided for scheduling input packets arriving at 
55 input ports of a switch to be sent to output ports of the switch, where the scheduler has a plurality of input port schedule 
modules. The steps of the method include: 

a) receiving a scheduling message from a previous schedule module by a current schedule module; b) computing 
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a future time slot for which said current schedule module will attempt to access one of said output ports; c) selecting 
one of said output ports to schedule for transmission at said future time slot; d) determining whether said future 
time slot has been previously reserved by said current scheduling module; e) determining whether said future time 
slot is blocked, when said future time slot not has been previously reserved; f) determining whether said future time 

5 slot was previously taken by another schedule module, when said future time slot is not blocked; g) determining 
whether a carry over operation was previously started from said scheduling message, when said future time slot is 
taken by another schedule module or has been previously reserved by said current scheduling module; h) setting 
said future time slot to be blocked and returning to step d), when said carry over operation was previously started; 
i) advancing said future time slot by a predetermined number of time slots, setting a carry over flag and returning 

w to step d), when said carry over operation was not previously started; j) taking said future time slot and entering 
information into said scheduling message indicating that said future time slot is taken, when said future time slot 
has not previously taken by another schedule module; and k) passing said scheduling message to a next schedule 
module. 

15 [0031 ] In another aspect of the invention, the data input through the input ports is queued using virtual output queu- 
ing that maintains separate queues for each of the output ports. 

[0032] In another aspect of the invention, the virtual output queuing for a particular port is independent of said vir- 
tual output queuing for the other ports. 

[0033] In another aspect of the invention, the service rates of said virtual output queuing are both predictable and 
20 adjustable. 

[0034] In another aspect of the invention, the scheduler selects said designated output port based on a weighted 
round robin. 

[0035] According to another aspect of the invention, a switch for controlling a flow of data in a network, includes: a 
plurality of input ports; a plurality of output ports; and a scheduler having a plurality of input port schedule modules, to 
25 schedule a particular input port of said input ports for sending data to a designated output port of said output ports. The 
schedule modules are connected in a ring and, at each time slot, each of the schedule modules receives reservation 
information from a previous schedule module, determines whether a future time slot is permitted to be reserved for said 
schedule module to send data, and sends reservation information including its own reservation for a future time slot to 
a next scheduler module. 

30 [0036] According to another aspect of the present invention, a method for scheduling input signals arriving at input 
ports of a switch to be sent to output ports of said switch having N input port schedule modules, comprises the steps 
of: a) setting a sequence of frames, each of the frames consisting of N time slots; and b) scheduling the input signals 
in a current frame so that the input signals are sent to the output ports in a next frame following the current frame. 
[0037] The step b) may include the steps of: b.1) receiving a scheduling message from a previous schedule module 

35 by a current schedule module; b.2) computing a future time slot for which said current schedule module will attempt to 
access one of said output ports, wherein said future time slot is included in the next frame; b.3) selecting one of said 
output ports to schedule for transmission at said future time slot; b.4) determining whether said future time slot has been 
previously reserved by another scheduling module; b.5) taking said future time slot and entering information into said 
scheduling message indicating that said future time slot is taken, when said future time slot has not previously taken by 

40 another schedule module; and b.6) passing said scheduling message to a next schedule module. 

[0038] In another aspect of the invention, the step b) comprises the steps of: simultaneously starting scheduling 
decision processes of the N input port schedule modules at the beginning of each frame; simultaneously performing the 
scheduling decision processes using a pipelined approach in said frame; and simultaneously completing the scheduling 
decision processes at the end of said frame. 

45 [0039] In another aspect of the invention, in the step b), scheduling decision processes of the N input port schedule 
modules are simultaneously performed in said current frame, wherein the N input port schedule modules make sched- 
uling decisions for different time slots of said next frame. 

[0040] In another aspect of the invention, in the step b), the input signals in said current frame are scheduled to be 
sent to the output ports in said next frame by referring to an N x N matrix which defines an ordered sequence of the N 
so input port schedule modules to visit a given time slot in the future. 

[0041 ] Additional objects and advantages of the invention will be set forth in the description that follows, and in part 
will be obvious from the description, or may be learned through practice of the invention. The objects and advantages 
of the invention may be realized and obtained by means the instrumentalities and combinations particularly pointed out 
in the appended claims. 

55 
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BRIEF DESCRIPTION OF THE DRAWINGS 
[0042] 

5 Fig. 1 is a diagram illustrating a centralized VOC scheduler; 

Figs. 2A and 2B are diagrams illustrating two distributed scheduler architectures; 
Fig. 3 is a diagram illustrating an input buffered switch architecture; 

10 

Fig. 4 illustrates the organization of an input port distributed scheduler; 

Fig. 5 is a schematic representation of pipeline scheduling decisions according to RRGS; 

is Fig. 6 is a schematic representation showing the respective sequences of visits to time slots for SM1 -SM4, which 
are obtained by converting the time chart of Fig .5 into a form suitable for the sequence of time slots for each SM; 

Fig. 7 shows an example priority matrix used to resolve collisions according to a first embodiment of the present 
invention; 

20 

Fig. 8 is a schematic representation of the pipeline scheduling decisions in the case of employing the priority matrix 
of Fig. 7; 

Fig. 9 shows another example priority matrix used to resolve collisions according to the first embodiment; 

25 

Fig. 10 is a schematic representation of the pipeline scheduling decisions in the case of employing the priority 
matrix of Fig. 9; 

Fig. 1 1 is a diagram depicting the carry over operation among SMs according to a second embodiment of the 
30 present invention; 

Fig. 1 2 illustrates a format for the S-message; 

Fig. 13 illustrates a format for the SM data structure; 

35 

Fig. 1 4 illustrates a flow chart for the CORPS scheduling algorithm according to the second embodiment; 
Fig. 15 illustrates the CORPS VOQ queuing model; 
40 Fig. 1 6 is a graph presenting packet delays as a function of system load; 

Fig. 17 illustrates the complementary delay distribution of a 16 x 16 switch equipped with a CORPS scheduler; 
Fig. 18 illustrates a block diagram of a CORPS controller; and 

45 

Fig. 19 is a graph showing expected delay versus system load for various competitive schedulers. 
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

so [0043] According to a first aspect of the present invention, the time axis, discussed above, is further divided into slot 
frames, each of which being simply a sequence of N consecutive slots. Thus, time can be regarded as a sequence of 
frames. In order to establish a criterion to resolve collisions among competing SMs, a priority matrix is used. An N x N 
priority matrix is a matrix which defines an ordered sequence of SMs to visit a given time slot in the future. The row of 
the matrix indexes a slot in the current frame, which is the frame containing the current system slot. The column of the 

55 matrix indexes the slot in the next frame to be visited. An element of the matrix specifies which SM should "visit" the slot 
in the next frame dictated by the column index. 

[0044] Fig. 7 shows a 4 x 4 priority matrix and Fig. 8 shows a pipelined sequence of visits to time slots for the pri- 
ority matrix of Fig. 7. It is noted that a pipelined decision process is already contained in the use of a priority matrix. For 
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instance, when the current time slot of the system is the second slot of the current frame, SM #3 is making a scheduling 
decision'about the second slot of the next frame, while SM #1 is making a scheduling decision about the fourth slot of 
the next frame. 

[0045] The priority matrix allows the time axis to be framed, resulting in a systematically ordered sequence of visits 
5 to time slots in each frame. In the frame F1 , for example, the respective scheduling decision processes of the SMs are 
simultaneously started at the beginning of the frame F1 , an ordered sequence of time slots for each SM to make sched- 
uling decisions is T8 -» T7 -» T6 -» T5 T8, and the respective scheduling decision processes of the SMs are simul- 
taneously completed at the end of the frame F1 . Compared with the sequence of visits to time slots according to RRGS: 
T10, T8, T11, T9,... , as shown in Fig. 6, the sequence according to the present invention is systematically ordered. 
io Therefore, it is easy to implement and control the SMs. 

[0046] Further, an N x N priority matrix defines a systematically ordered sequence of visits to time slots in the same 
manner regardless of whether the number of SMs is even or odd. 

[0047] Fig. 9 shows a 5 x 5 priority matrix and Fig. 1 0 shows a pipelined sequence of visits to time slots for the pri- 
ority matrix of Fig. 9. Similarly to the case of N=4, the sequence of visits to time slots is systematically ordered. In the 

15 frame F1 , for example, the respective scheduling decision processes of the SMs are simultaneously started at the 
beginning of the frame F1 , an ordered sequence of time slots for each SM to make scheduling decisions is T1 0 -> T9 
_> T8 -> T7 -* T6 -> T1 0, and the respective scheduling decision processes of the SMs are simultaneously completed 
at the end of the frame F1. Since an N x N priority matrix defines a systematically ordered sequence of visits to time 
slots in the same manner regardless of whether the number of SMs is even or odd, easier implementation and control 

20 of SMs can be achieved, compared with the RRGS. 

[0048] An N x N priority matrix is generated by rotating the sequence of SM neighbors on the same direction of the 
communication chain message passing. This ensures that every SM has timely information about ports which have 
already been scheduled. If a SM avoids choosing output ports which have been already picked by previous "visitors - , 
collisions can be completely avoided. 

25 [0049] As described above, by framing the time axis and using a priority matrix to reserve future time slots, a sys- 
tematically ordered sequence of visits to time slots can be obtained regardless of whether the number of SMs is even 
or odd, resulting in easier implementation and control of SMs. 

[0050] In a second embodiment, a Carry Over Round-robin Pipelined Scheduler, (CORPS) provides for a fair 
scheduler for high speed crossbar fabrics and also solves the problems of the prior art scheduler. CORPS has scala- 

30 bility properties regarding both line speeds and number of lines of a high speed switch fabric. For scalability with the 
number of lines, a distributed architecture, with message passing, has been chosen. Similarly to RRGS, a pipeline 
architecture is used in order to keep the message processing requirements scalable with line speeds. 
[0051] In order to provide fairness, while maintaining the original distributed architecture and message passing 
scheme, a carry over operation is introduced. The idea is that, when SM a visits a slot to which the desired output port 

35 has been taken by a SM which has preceded a, SM a carries over the scheduling attempt of that port to N slots into the 
future from the slot just attempted. If the slot is also found to be taken for the same output port, SM goes another N slots 
further, until it finds a slot to which the desired output port has not yet been taken. Fig. 1 1 illustrates the carry over oper- 
ation. 

[0052] Carry over operations can spread through up to N frames, depending on the number of SMs "colliding" at a 
40 given time slot The slots being affected by a carry over operation can be viewed as a set of slots taken to resolve the 
collision. Notice that a slot taken on a carry over operation will be visited again in subsequent frames by all SMs. So 
slots taken by carry over operations can potentially suffer new collisions, causing collision resolution sets to overlap. 
This could lead to the need of a /v* number of frames for resolving multiple collisions, or Af 3 slots in total. 
[0053] In order to reduce the memory requirements of the system, as well as the scheduling delay, the number of 
45 frames affected by carry over operations is limited by blocking a given SM, which has performed a carry over operation, 
from attempting another scheduling for the same output port over the slots taken to resolve that particular collision. In 
other words, a slot may not be used to resolve more than one collision at the same time. For instance, assume SM a 
finds a slot m taken by a given port p, which triggers a carry over operation. Let mx be the slot reserved by SM a as the 
result of this carry over operation. Then, any of the mn slots, 1 < n < x becomes unavailable (blocked) for SM a, for the 
so same port p. Therefore, the blocking feature ensures that multiple collisions over a given slot is forbidden. 

[0054] The CORPS scheduling algorithm will now be described. The messages passed in the communication 
chain, as well as the SM database in which scheduling decisions are recorded are described. Subsequently, a flowchart 
of the algorithm is described. 

[0055] A vector of scheduling decision elements (SE), to be passed from one SM to the next in the chain at every 
55 cell slot, is defined. An S-message contains scheduling elements (SE), with scheduling decisions made in the last N 
cell slots at most Thus, an S-message has at most N SEs. S-messages have the following format: 

• Time To Live - TTL: First set to N by the SM which has generated that SE. 
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• Time Slot ID - TSI: The ID of the slot scheduled, defined as the number of slots from the current TS until the slot to 
be scheduled. 

• SM-ID: The ID of the input port Scheduling Module that has placed the scheduling reservation. 

• Output Port ID - OPI: The ID of the output port scheduled. 



[0056] At the beginning of a slot, each SM receives a S-message from its upstream neighbor, which contains SEs 
attached in the last N slots^ Every SM makes at most one scheduling decision per time slot If SM p makes a scheduling 
10 decision, it creates a new SE with the following contents: SE-TTL = N; TSI = the number m of slots from the current one 
up until (and including) the selected time slot; SM-ID = p; and SE-OPI = the desired output port q over which a packet 
at time slot CTS + m is to be switched from input port p to output port q. 

[0057] Regardless of any scheduling decision, each SM must decrement the TTL of all other SEs in the S-mes- 
sage, dropping the SEs for which TTL = 0 before passing the message to the next SM. 
is [0058] Each SM has a memory array SC of (A/ + 1)/V positions. The first N positions record the current frame 
scheduling decisions, to be read by the crossbar switch module. These positions have identical information about the 
current frame among all SMs, and they can be accessed by the crossbar controller in several ways. Strictly speaking, 
SMs do not need to keep this information. The remaining /v^ positions are used to record future scheduling decisions. 
The memory array has the following format: The following fields are defined: 

20 

• Time Slot ID: The index to the SC array. It gives the time slot ID for which the SC position holds scheduling infor- 
mation. It is synchronized with the global clock provided by the crossbar module. This field wraps around, as the 
global clock progresses. 

• Blockage: It defines a set of output ports for which the SM is blocked from attempting a scheduling reservation. 
25 There can be up to N entries in this field. It is initially empty. 

• Reservations: ft records scheduling reservations for the given time slot CORPS ensures that all entries in this field 
for the current time slot (CTS) are identical across all SMs. Thus, the crossbar module can read the current 
input/output scheduling of cells from any SM(CTS). A consistency check of the algorithm can be performed by the 
crossbar module, by comparing this field among all SMs, if the crossbar controller has enough processing time. 



[0059] Each SM follows the CORPS scheduling algorithm described in this section. CORPS does not put any con- 
straint on which output port a given SM should attempt to schedule. It is up to each SM to choose which output port it 
wishes to schedule, following its own policy for serving its VOQs. CORPS scheduling algorithm is depicted as a flow- 
chart in figure 14. Each task box in the figure is now addressed. 



101. Receive S-message task: Receive S-message from previous SM; For each SE, decrement TTL. For a given 
SE, IF TTL> 0, decrement TSI and update SC at TSI. IF TTL = 0, remove the SE from S-message. Reset 
CAR RY= FALSE flag (see task 1 09). 

1 02. Compute slot to be attempted: Use the appropriate priority matrix to compute which Future Time Slot FTS to 
40 attempt scheduling. For convenience, the matrix can be encoded in a function f of the form FTS = f(CTS, 

SMJD). 

1 03. Pick output port Choose which output port (OPIS) the SM wishes to schedule for transmission. Notice that the 
strategy of choosing the output port could depend on the result of the previous task. CORPS does not specify 
this strategy (for instance, a weighted round robin choice of output ports could be used). 

45 104. Do I own the slot test: Simply check if among the RESERVATIONS entries of SC(FTS), any SM-ID is equal to 
the SM executing the scheduling. 

105. Am I blocked test Check if among the BLOCKAGE entries of SC(FTS), any OPI is equal to the output port 
OPIS for which a scheduling is being attempted. 

1 06. Pass S-message task: Pass S-message to the next SM. 

so 1 07. Is the slot taken test Check if among the RESERVATIONS entries of SC(FTS), any OPI is equal to the output 
port OPIS for which a scheduling is being attempted. 

108. Take the slot task: Make an entry in SC(FTS) RESERVATIONS with its own SM-ID, with OPI equal to OPIS; 
create a SE with TTL = N, TSI = FTS, SM-ID equal to its own ID, and OPI = OPIS. 

109. Carry over task: Test if carry over operation has been previously started. The flag CAR RY=TRU E/FALS E is 
55 checked. If CARRY=TRUE, make an entry into the BLOCKAGE field of SC(FTS), with OPI= OPIS; otherwise, 

set CARRY=TRUE. Set FTS = FTS + N. 

1 1 0. Sanity check: ETS should never be more than 2/v* positions away from CTS. If (ETS -CTS) > 2/V 2 , abort with 
an error message. 
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[0060] The following benefits can be derived through CORPS: 

[0061] No backlogged VOQ is starved, granted that it is eventually chosen by its SM. Assume a VOQ q is chosen 
by SM p. According to Fig. 1 4, the only way that SM p comes out of the reservation loop without successfully scheduling 
q is if it is blocked for the slot attempted. SM p being blocked means that queue q has been scheduled previously, and 

5 the remark follows. The only other way out of the loop is through the sanity check, but that would imply that the carry 
over operation did not find any available slot in the next N frames. Since there are at most N SMs involved in a collision, 
and multiple collisions are forbidden by the blocking procedure, the loop should never be exited this way. 
[0062] It is assumed that M is the set of m input ports (SMs) continuously attempting to schedule the same output 
port q. Moreover, let n q , {At) be the number of slots scheduled by SM i for output port q during At time slots. A scheduler 

w is m-fair if, for any interval of time At, and i, j e M, I nf(Af) - nj {At) I < N . In other words, a SM is never able to be N 
reservations ahead of any other SM. 

[0063] CORPS is m-fair, 1 < m < N. Suppose m SMs are colliding over an output port q at a given slot ts. Each of 
the m SMs colliding is not blocked for that slot, for otherwise they would not even be able to test if the slot had already 
be taken (Fig. 1 4, box 1 05). If these m SMs are not blocked at slot ts, there must be m idle slots among the ts + nN, 1 

is < n < i since the only way of accessing these slots into the future is through a carry over operation, and we know that 
these SMs have not performed any carry over operation over these slots, otherwise they would be blocked at ts. This 
means that, within the next i frames, each one of the colliding SMs will place a scheduling request for q. Now since they 
keep colliding for N consecutive slots of the current frame, and given that each collision generates one scheduling into 
the next i frames per SM, each SM will total N slots reserved in the next i frames for output port q. Thus, any subset of 

20 slots taken from the iN slots of the i frames over which the collision is being resolved can not contain any SM with more 
than N slots of advantage over any other SM. 

[0064] The last remark is interesting because it means that a constantly backlogged VOQ can never get more than 
N packets served ahead of the corresponding VOQ of another SM, no matter how long the measurement interval is. In 
fact, in a long enough interval, all colliding SMs will get strictly the same number of reservations. 
25 [0065] Additionally, under heavy load, queues with a common output port all have the same throughput, provided 
that they are all chosen by their SMs the same number of times (Fig. 14, box 1 08). 

[0066] Few comments about CORPS architecture are due. The communication chain used to pass scheduling 
information among the SMs can be used for changing the scheduling pattern of slots in any way desired, as long as it 
is at least N slots into the future. For instance, output port reservations could be also withdrawn. This feature could be 

30 useful for a SM which has just placed a reservation far into the future, due to a collision, and at the very next slot it real- 
izes that the port required is now free. If the SM places another reservation (an earlier one) for the same packet, the far 
away reservation may cause bandwidth wastage if not withdrawn. Reservation withdrawals, however, may adversely 
affect the properties described above. For instance, if a SM that has collided withdraws its reservation later on, it will 
adversely affect the delay of the packets scheduled after it in the same collision. In other words, a SM which has collided 

35 with i - 1 other SMs and later erased Its reservation due to this collision does not leave the system in the same state as 
if only i -1 SMs had collided in the first place. The present scheduler is as simple as possible, still satisfying the initial 
design goals. This ensures that the hardware required in an eventual implementation is kept simple. 
[0067] CORPS resolves collisions by spreading the packet scheduling along multiple frames. Thus, it is reasonable 
to expect that the average packet delay be large, as compared with other schedulers. To that end, the performance of 

40 CORPS under uniform traffic is analyzed. The ultimate goal is to evaluate how the cany over operation affects packet 
delay, and the maximum utilization can be obtained from the system, compared to competitive scheduling algorithms. 
[0068] An analytical model for CORPS is developed, in order to assess the scheduler performance, in terms of 
packet delay versus traffic load. Two major assumptions are made below: i) a uniform traffic arrival process; and ii) ran- 
dom VOQ queue selection by each SM (box 103 of Rg. 14), for sake of simplicity. 

45 [0069] A target VOQ queue Q mn of a given SM m, destined to output port n is defined. Packets arrive at every input 
port according to a Bernoulli process, with intensity p. More specifically, in any given slot, a packet has probability p of 
arriving at an input port Moreover, every packet has equal probability of being destined to any of the outgoing ports 
(assumption i). Thus, packet arrival process at our target VOQ queue n has a Bernoulli distribution with parameter p/N. 
[0070] Regarding VOQ selection, each non-empty queue of a given SM has equal probability of being selected for 

so scheduling (assumption ii). Therefore, for any VOQ, q is the probability of being selected, given that the VOQ in question 
is non-empty. Here we follow Chipalkatti et al. ('Protocols for Optical Star-Coupler Network using WDM" IEEE Journal 
on Selected Areas in Communications, Vol. 11, NO. 4, May 1993). If p is the utilization of all VOQs, the expected 
number of non-empty VOQ queues in a SM is given by 1 + (N - 1) p . 

[0071 ] It is convenient to introduce another probability, closely related with q. Let r be the probability that any queue 
55 be picked by its scheduler, r differs from q in the sense that q assumes that the queue in question is non-empty, while r 
does not have this restriction. It is not difficult to see that: 

r=pq=p/N (1) 
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[0072] The behavior of can be modeled in the following way. Packet interarrival time clearly follows a Geomet- 
ric distribution, with parameter p/N. The head of line packet has to wait until it is selected by the SM, which happens 
with probability q in a given slot. Once it is selected, there is a probability of being blocked from scheduling, according 
to Fig. 1 4 (box 1 05). If P™ is the probability of being blocked for a port m at a given slot, the waiting time of a head of 
5 line packet until it is picked by its SM follows a Geometric distribution, with parameter s = q x (1 -P b ) . The superscript 
may be dropped since the probability is the same for all output ports. Once Q mn is selected, it is assumed that a reser- 
vation is always placed in a future time slot, and the packet departs from the queue into a type of conveyor belt; where 
it awaits its reserved time slot to come by, at which time it departs from the system. 

[0073] The overall model used for Q mn queuing system is shown in Fig. 1 1 . An arriving packet first joins a 
io Geo(p/N)/Geo(s)/1 queue. Once it departs from the queue, it experiences an additional delay of 0^^, which is the 
delay resulting from the particular way CORPS resolves collisions. This is modeled by a box with an infinite number of 
servers. 

[0074] The expected delay of a packet that goes through CORPS is given by the sum of the expected delay for a 
Geo(pM)/Geo(syi plus the average delay D^. See, M. J. Karol, M. G. Hluchyj, S. P. Morgan, "Input Versus Output 
15 Queuing on a Space-Division Packet Switch', IEEE Transactions on Communications, Vol. COM-35, No. 12, pp. 1347- 
1 356, Dec. 1 987 This may be written as: 



20 



45 



50 



„ 'capf (2) 

2N(l-f) 

25 __ 

where S is the random variable with Geo(s) time distribution. The computation of , will now be discussed. 
[0075] Once Q mn is selected (the head of line packet departs from Geo(p/N)/Geo(s)/1) t several events can take 
place. First, SM m must make sure that it does not own the slot being attempted (box 1 04, Fig. 14). Let Pg be the prob- 
ability that a slot is owned by a SM to output port n. Additionally, let Fg be the probability that a given SM is blocked 
30 for output port n at a given time slot From this it can be derived: 

pg =1+ ^ 1+( A/-n)(i.r) /s/ - 2[1 ' (1 " r f) ^ 1] ] (3) 

35 

[0076] According to CORPS, a slot that is being visited by a SM can block the SM attempt to place a reservation 
only if this slot has been used to resolve a previous collision over the same output port. 
[0077] Consequently, the probability that a SM owns any port is: 

40 P 0 =H^Po) N (5) 

[0078] The expected delay Dq caused by box 1 04 of Fig. 14 is given by: 



N Tp -pV+* 1 



(6) 



[0079] If the slot first visited by SM m is available (tests of boxes 1 04, 1 05, and 1 07 ofFig. 1 4 all fail), it is easy to 
see that the average delay T>£Zp$ of a packet is N, due to the priority matrix scheme used. If > N, with no collision, 
at least one reservation would spill over to the second frame into the future. Now the delay D c incurred by collisions will 
be examined. If a collision with i - 1 other SMs occurs over a particular slot, the delay D c can vary from N up to iN, 
depending which priority order SM m has on that slot. Thus, let PID^^N I v = i] be the probability that the packet 
delay is jN given that SM m is the i-th SM to visit the slot. For instance, if m is the first SM to visit the slot, then: 
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[0 otherwise] 



(7) 



f o [0080] The previous expression simply states that, if SM m is the first to visit the slot, its packet will be delayed N 
slots, since CORPS schedules one frame into the future, if no collision occurs. Now recall that r is the probability that 
the VOQ queue for any output port of SM s(s* m) is non-empty and is picked by s, in particular output port n. Then, it 
is not difficult to see that a general expression for P[O COfps =yA/ 1 v=i\ is: 



15 



20 



0 ifj>i 

1 1/ (l ~ r)f "' otherwise 



(8) 



[0081] The first clause of Equation 8 states that if m is the i-th SM to visit the slot, its delay can be at most \N. The 
25 binomial expression in the second clause states that, If / - 1 SMs have visited the slot prior to m, any j - 1 SMs among 
these could be the ones colliding with m. The joint distribution of events of the form (0^^ = jN and v - i) can be easily 
derived by multiplying the previous expression by 1/N, since SM m is equally likely to be the i-th visitor of a slot 1 < i < 
N (see Fig. 7). 

[0082] The expected delay D c of a packet can then be derived as: 



30 



D>^r + N (9) 



35 [0083] The total delay incurred by CORPS scheduler is: 



«W=D 0+ D C (10) 

[0084] The last probability to be computed is P bt the probability that a SM is blocked for a given output port n, at a 
40 given slot. It can be shown that: 

^^^.(^^(^^.W^^.Q^)] (11) 

45 



[0085] Now noticing that for Geo(s), S = 1/s and S(S -1 ) = 2(1 -s)/s , the total average delay a packet experience 
in the system is: 



where s = q x (1 - P b ) . The first three terms account for the delay in the VOQ queue, before the scheduling takes place. 
The third term accounts for the extra time the packet needs to wait due to the pipeline and collision resolution features 
of CORPS. 
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[0086] In Fig. 16, a comparison of CORPS delay versus throughput analytical results against a simulation of a 16 
x 16 switch equipped with CORPS scheduler is shown. In the chart, there is a distinction between the average queuing 
delay a packet experiences until it gets picked by the SM scheduler, and the CORPS delay, due to the pipeline and col- 
lision resolution techniques used. As it can be seen, the analytical predictions match quite well the behavior of the sim- 
5 ulated system. 

[0087] The charts show that the scheduling delay dominates over queuing delay throughout the whole range of 
loads. Only for very high loads, when the queue starts building up, does the queuing delay become significant. This 
means that CORPS is doing a fine job in scheduling packets into the future as soon as they arrive at their VOQ queues. 
On the other hand, the mean delay incurred by CORPS grows from roughly one frame, under light loads, to about 5 

w frames, when the load reaches 0.85. 

[0088] For completeness, Fig. 1 7 shows the complementary distribution of total delay in a 1 6 x 1 6 CORPS switch. 
The curves are for loads of 0.8 and 0.85, obtained by simulation. First it can be noticed that no packet takes more than 
slots to get through the system. This is due to the fact that CORPS does not allow multiple collisions to occur. In fact, 
the tail of the distribution seems to end somewhere near /v^/?=128. It is likely, though, that a packet delay of be 

is approached, if the system is driven by extremely large loads. 

[0089] Fig. 18 illustrates, in block diagram, an implementation of CORPS. A VOQM module enqueues packets in 
virtual output queues. This module also makes request on behalf of a given queue to the SM module. The SM module 
controls the message passing and implements the CORPS scheduler It communicates back with its VOQM to inform 
of a future slot reservation. This information is kept at the VOQM, so that at a given slot, a packet is transferred to the 

20 crossbar register to be switched. 

[0090] In the figure, the communication between the SMs and the crossbar controller is shown to take place over a 
bus, although this particular type of communication is not necessary. 

[0091] A fair comparison among scheduling algorithms should take into account not only performance measures, 
such as average delay and throughput, but also complexity and implementation costs. The first criterion for selection is 

25 high throughput Moreover, only schedulers which operate with VOQ are compared. Thus, the following competitive 
schedulers are compared with the present invention: 1-SLIP and RRGS. The reason for selecting 1 iteration SLIP, as 
opposed to several iterations, is fairness in the comparison process. That is, it is assumed that at most one decision per 
slot can be made at any input port. i-SLIP, /> 1 would effectively require more than one scheduling decision per slot 
[0092] For performance comparison, both analytical and simulation results have been relied upon. The delay per- 

30 formance of RRGS and SLIP, for uniform traffic, can be approximated by: 



35 



40 



■ pQ-q) 1 N 



(14) 



45 [0093] For RRGS results, see A. Smiljanic, R. Fan, G. Ramamurthy, "RRGS- Round- Robin Greedy Scheduling for 
Electronic/Optical Terabit Switches', NEC C&C Research Laboratories, Technical Report TR 98-C063-4-5083-2, 1 998 
and for SLIP results, see N. McKeown, 'Scheduling Cells in an Input-Queued Switch", PhD Thesis, University of Cali- 
fornia at Berkeley, 1 995. Fig. 1 9 depicts the average delay versus throughput performance of these algorithms, against 
CORPS. 

so [0094] From the figure, it is evident that RRGS and CORPS are able to sustain much higher loads than SLIP, before 
delays become significant. One can easily see that the derivative of these curves, for high load, is significantly smaller 
for RRGS and CORPS. However, both algorithms carry an offset delay budget for medium to light loads. For RRGS, this 
is exclusively due to the pipeline technique used. For CORPS, the additional delay is due to collision resolution, as 
addressed in the previous section. However, CORPS has two advantages over RRGS: i) It allows freedom of choice 

55 about which output port a SM should pick; ii) It is a strictly fair scheduler. SLIP is also a fair scheduler although its col- 
lision resolution process is entirely different than the CORPS one. 

[0095] As mentioned before, CORPS gives complete freedom of choice as to which output port a schedule should 
be attempted. Namely, each VOQM is free to choose any output port to be scheduled on behalf of a given VOQ. This 
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fact was an important part of the scheduler design strategy. Therefore, many algorithms can be used for VOQ selection, 
in conjunction with CORPS. So far, one such algorithm, namely, random selection among non-empty VOQs has been 
discussed. Other examples of VOQ selection strategies are possible. The VOQ selection strategies may be classified 
into two classes: cooperative and non-cooperative selection strategies. 
5 [0096] Non-cooperative VOQ selection strategies are those for which a VOQ selection decision is made per input 
port (VOQM), independently from other input ports. The random selection strategy used in the analysis of CORPS 
belongs to this class. 

[0097] Weighted Fair Queuing (WFQ) is a popular service strategy in the packet switching research literature. See, 
for example, H. Zhang, 'Service Disciplines for Guaranteed Performance Service in Packet-Switching Networks', In 
io Proceedings of IEEE, Vol. 83, no. 1 0, pp. 1 374-1396, Oct. 1 995. The idea is to regulate service rates of various queues 
competing for an output link capacity, according to predefined weights. In a VOQ CORPS switch, an output port band- 
width can be split among various VOQMs, by some sort of a Call Admission Controller. WFQ then can be used to 
enforce that the maximum service rate of a VOQ queue does not exceeds the VOQM bandwidth share of a given output 
port. 

15 [0098] Rate-Controlled Service (RCS) discipline assumes that a given traffic flow satisfy certain burstiness con- 
straints at the network entry point See, L. Georgiadis, R. Guerin, V. Pens, 'Efficient Network QoS Provisioning Based 
on per Node Traffic Shaping,' Proceedings of INFOCOM96, vol.1, pp. 102-1 10, 1996. These constraints are typically 
enforced by a traffic shaper at the edge of the network. Moreover, traffic shapers are also placed at intermediate 
switches, so that the traffic can be brought back to comply with such constraints at each and every intermediate switch- 

20 ing point in the network. A traffic shaper is typically implemented by a leaky bucket algorithm. J. Turner, 'New Directions 
in Communications, or Which Way to the Information Age?', IEEE Communication Magazine, Vol.24, pp. 8-15, 1986 
describes one such algorithm. A basic leaky bucket is a queue with system with two queues, one for data, and one for 
tokens, or permits. A data packet on the queue needs a permit to be eligible for service. Only a limited number of per- 
mits are stored at the permit queue. Permits are generated at a constant rate. A traffic shaper of this kind could be used 

25 to regulate which among the VOQs are eligible for service. Among the eligible ones, any algorithm could be used for 
queue selection. 

[0099] The two service disciplines described above may be used for the support of Quality of Service (QOS) in 
packet networks, in itself an active research area. Such a QOS supportive strategy is likely to be of a non-cooperative 
type, as it is supposed to ensure a predicted service behavior of its VOQs, regardless of other cross traffic streams. 
30 Algorithms belonging to this class are likely could used in switches supporting stringent QoS applications, such as 
video and voice streams. 

[0100] Cooperative VOQ selection strategies are those for which a VOQ selection depends on the state of the 
entire set of VOQs in the switch. These strategies typically aim at providing a good overall switch behavior, such as 
maximizing throughput, rather than concentrating on the service of each flow. Thus, the use of these strategies in 
35 switches occurs in supporting data traffic, with no commitment to any QoS requirements. 

[0101] For cooperative strategies, additional information needs to be provided to the CORPS scheduler, such as 
the state of other VOQs. The information about the state of the queues is always 'stale', so the service strategy must 
be robust with regard to stale information. 

[0102] A maximum matching problem consists in finding, among edges of a given graph, a subset of edges which 

40 "pairs" together the vertices of the graph, in a way to maximize the total number of pairs. See, Gormen, Leiserson and 
Rivest, "Introduction to Algorithms", McGraw-Hill, 1990. However, no vertex can have more than one selected edge 
attached to it If the number of packets switched at every slot is maximized, a Maximum Bipartite Matching (MBM) prob- 
lem needs to be solved. See, R. E. Tarjan, "Data Structures and Network Algorithms", Society for Industrial and Applied 
Mathematics, Pennsylvania, Nov. 1983. Algorithms for solving the MBM are available, with reasonable computation 

45 complexity. See, J.E. Hopcroft, R. M. Karp, "An n 5 * 2 Algorithm for Maximum Matching in Bipartite Graphs', Society for 
Industrial and Applied Mathematics J. Compute 2 (1973), pp. 225-231. In the present invention, VOQs empty/non- 
empty state information can be sent through the communication chain, and passed to the VOQMs, where a MBM algo- 
rithm would decide which queue to serve in a slot of the next frame. Interestingly enough, CORPS allows queues not 
selected by the MBM algorithm to also attempt a reservation into the future. 

so [0103] A Maximum Weight Bipartite Matching (MWBM) problem is similar to the MBM problem just described. The 
major difference is that, in the former, weights are associated with the edges of a graph, and the objective is to find a 
set of edges which maximizes the sum of the edge weights of the matching. Others have used MWBM algorithms to 
show that, under nonuniform traffic, they outperform MBM strategies in terms of throughput. See, N. McKeown, V. 
Anantharam, J. Walrand, 'Achieving 100% Throughput in an Input-Queued Switch', Proceedings of lnfocom96, San 

55 Francisco, March 1 996. The idea is to use VOQ queue sizes as weights, in order to handle non-uniform traffic scenar- 
ios. 

[0104] The above reference also shows that MWBM algorithms are stable, i.e., VOQ queues never blow up, as long 
as the input traffic is admissible. A traffic is admissible if the sum of the input traffic rates towards a single output port 
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does not exceed its capacity, for every output port. An interesting result of this is that the stability of the MWBM is main- 
tained, even in the presence of stale information, that is, the weights are based on queue levels of some number of time 
slots in the past Thus, again VOQs' queue level information can be passed to all VOQM, so that a MWBM algorithm is 
run at each module, before a request for an output port is issued to the SMs. 
5 [0105] Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention 
in its broader aspects is not limited to the specific details shown and described herein. Accordingly, various modifica- 
tions may be made without departing from the spirit or scope of the general inventive concept as defined by the 
appended claims and their equivalents. 

10 Claims 

1 . A switch for controlling a flow of data in a network, comprising: 

input ports; 

15 

output ports; and 

a scheduler having a plurality of input port schedule modules, to schedule a particular input port of said input 
ports for sending data to a designated output port of said output ports, 

20 wherein a current schedule module receives a scheduling message from a previous schedule module, com- 

putes a future time slot for which said current schedule module will attempt to access said designated output 
port, determines if said future time slot is valid based on whether said future time slot is currently reserved by 
said current schedule module, whether said future time slot is blocked and whether said future time slot is 
taken by another schedule module and takes said future time slot if valid and enters information into said 

25 scheduling message indicating that said future time slot is taken. 

2. A switch as claimed in claim 1 , wherein said scheduler advances said future time slot by a predetermined number 
of time slots when said future time slot has been reserved or taken. 

30 3. A switch as claimed in claim 1 , wherein said data input through said input ports is queued using virtual output queu- 
ing that maintains separate queues for each of said output ports. 

4. A switch as claimed in claim 3, wherein said virtual output queuing for a particular port is independent of said virtual 
output queuing for the other ports. 

35 

5. A switch as claimed in claim 3, wherein service rates of said virtual output queuing are both predictable and adjust- 
able. 

6. A switch as claimed in claim 1 , wherein said scheduler selects said designated output port based on a weighted 
40 round robin. 

7. A method of scheduling input signals arriving at input ports of a switch to be sent to output ports of said switch hav- 
ing a plurality of input port schedule modules, comprising the steps of: 

45 a) receiving a scheduling message from a previous schedule module by a current schedule module; 

b) computing a future time slot for which said current schedule module will attempt to access one of said output 
ports; 

c) selecting one of said output ports to schedule for transmission at said future time slot; 

d) determining whether said future time slot has been previously reserved by said current scheduling module; 
so e) determining whether said future time slot is blocked, when said future time slot not has been previously 

reserved; 

f) determining whether said future time slot was previously taken by another schedule module, when said 
future time slot is not blocked; 

g) determining whether a carry over operation was previously started from said scheduling message, when 
55 said future time slot is taken by another schedule module or has been previously reserved by said current 

scheduling module; 

h) setting said future time slot to be blocked and returning to step d), when said carry over operation was pre- 
viously started; 
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i) advancing said future time slot by a predetermined number of time slots, setting a carry over flag and return- 
ing to step d), when said carry over operation was not previously started; 

j) taking said future time slot and entering information into said scheduling indicating that said future time slot 
is taken, when said future time slot has not previously taken by another schedule module; and 
5 k) passing said scheduling message to a next schedule module. 

8. A method of scheduling as claimed in claim 7, wherein said data input through said input ports is queued using vir- 
tual output queuing that maintains separate queues for each of said output ports. 

io 9. A method of scheduling as claimed in claim 8, wherein said virtual output queuing for a particular port is independ- 
ent of said virtual output queuing for the other ports. 

10. A method of scheduling as claimed in claim 8, wherein service rates of said virtual output queuing are both predict- 
able and adjustable. 

15 

11. A method of scheduling as claimed in claim 7, wherein said scheduler selects said designated output port based 
on a weighted round robin. 

12. A switch for controlling a flow of data in a network, comprising: 

20 

a plurality of input ports; 

a plurality of output ports; and 

a scheduler having a plurality of input port schedule modules, to schedule a particular input port of said input 
ports for sending data to a designated output port of said output ports, 
25 wherein the schedule modules are connected in a ring and, at each time slot, each of the schedule modules 

receives reservation information from a previous schedule module, determines whether a future time slot is 
permitted to be reserved for said schedule module to send data, and sends reservation information including 
its own reservation for a future time slot to a next scheduler module. 

30 13. A method for scheduling input signals arriving at input ports of a switch to be sent to output ports of said switch hav- 
ing N input port schedule modules, comprising the steps of: 

a) setting a sequence of frames, each of the frames consisting of N time slots; and 

b) scheduling the input signals in a current frame so that the input signals are sent to the output ports in a next 
35 frame following the current frame. 

14. A method according to claim 13, wherein the step b) comprises the steps of: 

b.1 ) receiving a scheduling message from a previous schedule module by a current schedule module; 

ao b.2) computing a future time slot for which said current schedule module will attempt to access one of said out- 

put ports, wherein said future time slot is included in the next frame; 
b.3) selecting one of said output ports to schedule for transmission at said future time slot; 
b.4) determining whether said future time slot has been previously reserved by another scheduling module; 
b.5) taking said future time slot and entering information into said scheduling message indicating that said 

45 future time slot is taken, when said future time slot has not previously taken by another schedule module; and 

b.6) passing said scheduling message to a next schedule module. 

15. A method according to daim 13, wherein the step b) comprises the steps of: 

so simultaneously starting scheduling decision processes of the N input port schedule modules at the beginning 

of each frame; 

simultaneously performing the scheduling decision processes using a pipelined approach in said frame; and 
simultaneously completing the scheduling decision processes at the end of said frame. 

55 16. A method according to claim 13, wherein in the step b), scheduling decision processes of the N input port schedule 
modules are simultaneously performed in said current frame, wherein the N input port schedule modules make 
scheduling decisions for different time slots of said next frame. 
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17. A method according to daim 13, wherein in the step b), the input, signals in said current frame are scheduled to be 
sent to the output ports in said next frame by referring to an N x N matrix which defines an ordered sequence of the 
M input port schedule modules to visit a given time slot in the future. 



10 



15 



20 



25 



30 



40 



45 



50 



55 



16 



EP 1 061 763 A2 



FIG. 1 



CONTROL 




FIG. 2A 



FIG. 2B 




-e 




3-E 


3-e 


— E 


3-e 

3-£ 


3-e 


3-e3 

3-£3 










-e 


H 


3-e 


3-£3 



(G)RAhmCA)RBITER 
ARCHITECTURE 



SINGLE BLOCK PER 

CROSS POINT ARCHITECTURE 



17 



EP 1 061 763 A2 



FIG. 3 



y 



FIG. 4 

CROSSBAR 



— 1 








i 

























18 



EP 1 061 763 A2 



FIG. 5 




TIME SLOT 



FIG. 6 



V TIME 

sSLOT 

SCHED- 
ULERMQPUUS 


T1 


T2 


T3 


T4 


T5 


T6 


n 


T8 


T9 


T10 


Til 


SMI 


T6 


T4 


T7 


T5 


T10 


T8 


T11 


T9 


T14 


T12 


T15 


SM2 


T3 


T6 


T4 


TO 


T7 


T10 


T8 


T13 


Til 


T14 


T12 


SM3 


T5 


T3 


T8 


T6 


T9 


T7 


T12 


T10 


T13 


Til 


T16 


SM4 


T2 


n 


T5 


T8 


T8 


Til 


T9 


T12 


T10 


T15 


T13 



19 



EP 1 061 763 A2 



r 

i co 


| CM | J 




| CM 


: t- | ^ : co 


CO 




\<+ | CO | CM 


CM 




| CO J CM | 





^ 1 CO ' CM 1 
111 



UI 



00 



DC 
CC 



o 

El 



CO N O OD 



S OK CO o 



o o co s 



CO CD 
CO 

P P 



in 

P_P 
to 10 
P P 




O a, CM ■ 

i — i — i — ^ 
~P p £ P 



■<»• CO CM <r- 

2 2 2 2 

CO CO CO CO 



to 



p 

CO 

p 

CM 

p 



o 

p 

p 



p 



p 



12 

p 
p 
p 



20 



EP 1 061 763 A2 



FIG. 9 



NEXT FRAME 



SLOT 
5 
4 
3 
2 
1 



1 



1 2 3 4 i 5 
CURRENT FRAME 



SLOT 



FIG. 10 





no 


TB 


TB 


n 


TB 


T15 


TI4 


TO 


T12 


Til 


TO 


TIB 


Tia 


T17 


TIB 


TZ5 


124 


TO 


T22 


121 




S*4 


T9 


TB 


T7 


TB 


T10 


T14 


TI3 


TO 


Til 


TIB 


TIB 


TI8 


T17 


TIB 


TO 


TM 


123 


T22 


T21 


TO 




m 


TO 


n 


TB 


no 


TB 


T13 


TO 


Til 


T15 


T14 


TIB 


TI7 


us 


T2D 


TIB 




T22 


T21 


TO 


T24 






T7 


TB 


710 


Tt 


TB 


TU 


TU 


TO 


T14 


TO 


TIT 


TIB 


TO 


TIB 


TIB 


rzz 


121 


TO 


T24 


TO 




an 


TB 


T10 


11 


TB 






T15 


TO 


TO 


Til 


TIB 


TO 


T19 


TIB 


T17 


121 


125 


T24 


TO 


TO 






' T1 


T2 


13 


T4 




TB 


n 


TB 


IB 


T10 


TI1 


TO 


TO 


TI4 


T15 


TIB 


T17 


TIB 


Tit 


'TO 





FRAME F1 



FRAME F2 



FRAME F3 



FRAME F4 



21 



EP 1 061 763 A2 




EP 1 061 763 A2 



FIG. 12 



TIME TO UVE (TTL) 




TIME SLOT ID 




SM-ID 




OUTPUT PORT ID 




TIME TO LIVE (TTL) 




TIME SLOT ID 




SM-ID 




OUTPUT PORT ID 





SE 



SE 



TIME TO LIVE CTTL) 




TIME SLOT ID 




SM-ID 




OUTPUT PORT ID 





SE 



23 



EP 1 061 763 A2 



FIG. 1 3 



TIME SLOT ID 


BLOCKAGE 


OPI 
OPI 

OPI 


RESERVATIONS 


SM-ID 
SM-ID 

SM-ID 


TIME SLOT! 


D 


BLOCKAGE 


OPI 
OPI 

OPI 


RESERVATIONS 


SM-ID 
SM-ID 

SM-ID 




TIME SLOT ID 


BLOCKAGE 


OPI 
OPI 

OPI 


RESERVATIONS 


SM-ID 
SM-ID 

SM-ID 



24 



EP 1 061 763 A2 



FIG. 14 



OK 



CORPS SCHEDULING 



GETS-MSG, 
UPDATE SC, 
DROP TTL=0 ELEMS 



1 



COMPUTE SLOT 
TO BE ATTEMPTED 



I 



PICK OUTPUT 
PORTP 



101 



102 



103 




106 

X 



PASS S-MSG 



108 



TAKE THE SLOT 
INCLUDE 
RESERVATION 
INTO S-MSG 



EXECUTE CARRY 
OVER OPERATION 



110 



SANITY CHECK 
(<2N*) 



NOKi 



ABORT 



J 



25 



EP 1 061 763 A2 



FIG. 1 5 




CORPS 



26 



EP 1 061 763 A2 



FIG. 1 6 



140 



CORPS-SIMULATION VS ANALYSIS-N=1 6 



120 



5 

m 

Q 
Q 

1 

ill 

0l 

IS 



100 



80 



60 



40 



20 



Geo/Geo/1-ANA 
CORPS-ANA 
TOTAL-ANA 
QUEUEINGDELAY^SIM 
CORPS-SIM 
TOTAL-SIM 




02 



0.4 0.6 
SYSTEM LOAD 



27 



EP 1 061 763 A2 



FIG. 1 7 




0 20 40 60 80 100 120 140 

DELAY 



28 



EP 1 061 763 A2 



FIG. 18 



VOQM 



VOQM 



REGISTER | 




REGISTER | 



SM 



REGISTER | 




REGISTER 



SM 



CROSSBAR 
CONTROLLER 



29 



EP 1 061 763 A2 



FIG. 1 9 



SCHEDULERS-SIMULATION VS ANALYSIS-N=16 




SYSTEM LOAD 



30 



